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1. Introduction 


One of the fundamental tasks in bioinformatics is sequence 
comparison, which is used heavily in database searching, sequence 
classification, phylogenetic tree reconstruction and detection of 
regulatory sequences. In most cases, alignments are performed be- 
tween the target sequences by dynamic programming techniques 
and the resulting alignment scores are used to calculate a measure 
of similarity. Meanwhile, especially in recent years, an increasing 
number of alignment-free methods have emerged [1-4]. In con- 
trast to traditional alignments, these alignment-free methods 
mostly (i) make few assumptions of the evolutionary model and 
(ii) present light computational load. With the first merit, align- 
ment-free methods do not suffer greatly from some evolutionary 
events, e.g., large rearrangements and transposon activity. While 
the second merit enables broad contributions of alignment-free 
comparisons in pre-filtering relevant sequences, and then using 
alignment algorithms to refine the searches. This type of heuristic 
approach is already used in programs like BLAST [5] and FASTA [6]. 
Additionally, after the completion of many genome projects, align- 
ment-free comparisons begin to find their use in whole genome 
phylogeny, which meets great computational and theoretical chal- 
lenges using alignment-based methods. 
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In this paper, we propose two metrics to compare DNA and protein sequences based on a Poisson model 
of word occurrences. Instead of comparing the frequencies of all fixed-length words in two sequences, we 
consider (1) the probability of ‘generating’ one sequence under the Poisson model estimated from the 
other; (2) their different expression levels of words. Phylogenetic trees of 25 viruses including SARS-CoVs 
are constructed to illustrate our approach. 


© 2008 Elsevier Inc. All rights reserved. 


Sequence comparison based on word statistics may be the most 
well-developed alignment-free method. Observing that relative 
abundances of all dinucleotides are remarkably constant across 
the genome, Karlin et al. [7-9] proposed the ‘genome signature’ 
to describe a genome. The ‘signature’ consists of the array of dinu- 
cleotide relative abundances p,, = fry/fxfy extended over all dinu- 
cleotides, where f, is the frequency of nucleotide x and fiy is the 
frequency of dinucleotide xy. In the same manner, genome signa- 
ture based on abundances of k-nucleotides can also be defined. 
Reinert et al. [10] studied the statistical and probabilistic proper- 
ties of words in sequences, with emphasis on the deductions of ex- 
act distributions and evaluation of its asymptotic approximations. 
Word-based comparisons were recently reviewed by Vinga and Al- 
meida [2]. According to their work, biological sequences are first 
represented as frequency vectors in Euclidean space, and then pair- 
wise distances between these sequences can be defined as the 
standard Euclidean distance, Mahalanobis distance, linear correla- 
tion coefficient or Kullback—Leibler discrepancy between their cor- 
responding vectors. As another powerful tool for sequence 
analysis, some graphical representations of DNA or protein se- 
quences are also based on statistics of short words [11,12]. 

In this paper, we propose two distance measures for biological 
sequences on the basis of word statistics. Instead of comparing 
the frequencies or relative compositions of each word type in 
two sequences, we explore two measures in the probabilistic 
framework. Some basic concepts and our computational methods 
are introduced in the following section. To illustrate our method, 
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in Section 3, similarity trees of 25 virus genomes are built by some 
classical distances and our methods. 


2. Methods 


A sequence S, of length I, is defined as a linear succession of 
symbols from a finite alphabet .o, of length n. A k-word (or k- 
mer, k-tuple, etc.) @ = 0102--- a, is a subsequence of k adjacent 
letters, o%; € ~,i=1,2,...,k. Obviously, there are a total of n‘ 
possible k-words for the alphabet .°. The occurrence of w (de- 
noted by N,,) is the number of times it is seen through sliding 
a window of width k once across the sequence, and frequency 
of this word f,, is obtained by simply dividing the total number 
of words (i.e., fi. = Nwo/(l/—k+1)). Given a symbol sequence, we 
can represent it as a point in the high dimensional Euclidean 
space by a mapping from S to the vector of its word counts, 
or frequencies: 


ING) Wor NasyeceniNen) OP FS) = Vous lonvetestay) 


For DNA sequences, ./ = {A,G,C,T}. If 2-words are considered and 
words in above vectors are arranged as (AA,AG, AC, AT,GA,GG,..., TT), 
the corresponding vectors for S = AAAGGA are 


N(S) = (2,1,0,0,1,1,0,...,0) and 
f(S) = (0.4, 0.2, 0,0, 0.2,0.2,0,...,0). 


To evaluate the distance between two sequences, it is intuitive to 
compute the norm of the difference between their corresponding 
frequency (or occurrence) vectors, 


nk D 
ee io, — Rol’, 


where fi, and f,,,, are frequencies of the word ; in sequences 
S; and S, respectively. The norm gives mathematically well defined 
distance functions for all positive values of p. Here p = 1 gives the 
Manhattan distance, which was used in [7,13]; p=2 gives the 
Euclidean distance [14]; p =oo gives the max-norm (where only 
the largest absolute value contributes). However, these simple dis- 
tances are not satisfying for an accuracy phylogeny, because (1) they 
treat all word types equally, despite that they have different 
background frequencies, and (ii) contribution of a word may 
not merely be a polynomial function of the frequency difference. 
In order to overcome the above problems, the Mahalanobis and 
standard Euclidean distance, which take into account the data 
covariance structure, were proposed for sequence comparison rel- 
atively recently [15]. In this paper, we will propose two distance 
measurements free of such problems by using a probabilistic 
framework. 

The most immediate model for word occurrences is the bino- 
mial distribution, i.e., each word @ has the same probability p to 
appear at any word location. When p is very small, sequence length 
! is sufficiently large, and the value of Ip is moderate, the occur- 
rences of w in this sequence approximately follow the Poisson dis- 
tribution with the parameter Ip. In what follows we will explore 
two distance metrics on the basis of the Poisson distribution of 
word occurrences. 


d(S;,S2) = 


2.1. The relative Poisson distance 


For simplicity, we assume that S; and Sz have the same length | 
(or else we can normalize one of them). Occurrences of word «; in 
these two sequences are denoted by N;,,, and N2~,, respectively. 
In the first step, we use S; to estimate the Poisson parameter. 
Known that the parameter 4 of Poisson model is equal to the 
expectation of the variable (word occurrence, in our model), we 
intuitively set 4 = Ni .,. Then define 


; N = N20; . e N10; 
RP,,, (51,52) = Poi(N20,;Ni,) = ao | 


where Poi(k; 2) is the Poisson probability with parameter 4, 


e+ 

k! y) 
Actually, RP.,(Si,S52) measures a kind of ‘similarity’ between 
S; and Sz in terms of the occurrences of a; (note that it is not a 
Strict similarity measure as it is not symmetrical). Explicitly, low 
values of RP,,, correspond to the relatively large discrepancies in 
occurrences of the word q,, and the maximum value is gotten when 
Nie, = N20, OF Ni, =N20,+1. Taking all words into consider- 
ation, the final distance between S,; and S, is defined 


Poi(k; 2) = 


nk 


dpp(S1, $2) = S—(RPo, (S1, 51) + RPo, (S2,$2) — RPo, (S1, S2) 


i=1 


— RPo,(S2, $1). (2) 
Here the two terms RP,,,(S;,51) and RP.»,(S2,S2) are intro- 
duced to guarantee the positivity of dpp(S;,S2) (note that 


RP. ($1, $1) = RPo, (S1,S2) for any word «)). 


Since RP,,,(Si,S2) measures the probability to observe N2., 
times of jw; in sequence S2 in the condition that the average occur- 
rence is Ni,»,, we refer to dpp as the Relative Poisson distance be- 


tween S, and Sp. 
2.2. The distance based on expression level of words 


In the above subsection, we consider only one Poisson model - 
parameter of this model is estimated by one sequence, and pair- 
wise similarity is evaluated by the probability of generating the 
other sequence under this model. In this part, the occurrences of 
word q; in sequences S; and S2 follow two different Poisson distri- 
butions (with parameters 4,; and /2;, respectively). Define 


N10; 


EXP1«, = > Poi(k; 41), (3) 
k=0 


where No, is the occurrence of (; in S;. Exp, ,,, is the probability of 
observing < Nj, occurrences of c; in sequence S;. Note that a word 
is called highly expressed if its observed frequency is more than its 
expected frequency, and called low expressed otherwise. In this 
sense, the probability Exp, ,,, measures a level of expression — low 
value of Exp, ,,, corresponds to low expression of word @, and large 
value of Exp, ,,, corresponds to high expression of the word @ in 
sequence S;. We define the final distance between S; and S2 as 


nk 
dexp (S1,52) = S- EXP ws, _ EXP) .»,|. (4) 
i=1 


Now, to compute dgxp(S1,S2) we need to determine 2,; and A2; for 
each word in each sequence. Note that the Poisson parameter for 
a word is actually its expected occurrence, which can be obtained 
immediately by multiplying the expected frequency (or background 
frequency) by the total number of words. We now only need to 
determine the background frequency of each word. To achieve this 
aim, two approaches are tried. The first approach corresponds to 
independence of nucleotides in the sequence, i.e., background fre- 
quency of the word q@ is estimated by the product of the corre- 
sponding nucleotide frequencies in this sequence, 


io =e = fata _? ‘fry, (3) 


where f,, (i= 1,2,...,k) is the frequency of the letter «; in this se- 
quence. An alternative method for estimating the background fre- 
quency of a word was proposed by Qi et al. [16], who applied a 
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Markov model of DNA sequences of order k — 2. The expected fre- 
quency of a word is predicted from the probabilities of appropriate 
shorter subwords 


ie Sf, OL +--0bp — feet Poa ’ (6) 
Ferg ot5---04- 

where fy,0,.-«,_, 18 the frequency of the (k — 1)-word 0102 --- _1 in 

the corresponding sequence. Then for each background probability 

estimated by Eqs. (5) and (6), the corresponding Poisson parameter 

is 


ho =fo-(l—-k +1). 


The distance dgxp) has the following properties: (1) dexp($1,S2) = 
O and dexp(S1, $1) = 0, for any sequences S; and Sp; (ii) background 
information (or frequencies of shorter words) is incorporated into 
the measurement; and (iii) words with identical frequency and 
occurrence in two sequences may contribute to dgxp, i.e., they may 
have different background frequencies and expression levels. 
When 2 is large (A > 50), however, it is difficult to obtain the 
accurate Poisson probability by Eq. (1) using personal computers. 


Explicitly, as e~* is very small and /* is very large in the numerator, 
mistakes may be made if they are multiplied directly. In order to 
overcome this difficulty, another two executive approximations 
of Poisson probability in the case of large J are tried: (i) Stirling for- 


mula. According to the Stirling formula, k! ~ (k/e)/2mk, so 


P(k; i= aie = (A/k)“ek-*,\/2itk. (ii) Normal approximation of 


Poisson distribution. When 4 is sufficiently large, the Poisson dis- 
tribution with parameter 4 can be approximated by the Normal 
distribution N(J, 2). 


3. Application 
3.1. Phylogenetic trees of 25 viruses including SARS-CoVs 


Coronaviruses are the causative agents of a number of 
mammalian diseases which often have significant economic and 
health-related consequences [17,18]. On the basis of antigenic 
cross-reactivity, coronaviruses were originally classified into three 
groups. Group I and group II contain mammalian viruses (while 
group II coronaviruses contain a hemagglutinin esterase gene 
homologous to that of Influenza C virus [19]), and group III con- 
tains only avian viruses. After the outbreak of severe acute respira- 
tory syndrome coronavirus (SARS-CoV) in 2003, many efforts have 
been made to identify the phylogenetic positions of SARS-CoVs in 
the coronavirus phylogeny. However, this is still a controversial to- 
pic - alignment-based methods showed that SARS-CoVs are not 
closely related to any previously isolated groups and form a new 
group [20,21]; maximum likelihood tree built from a fragment of 
the spike protein preferred SARS-CoVs clustering with group II cor- 
onaviruses (murine hepatitis virus and rat coronavirus) [22]; while 
an information-based method, which made use of the whole gen- 
ome sequences, indicated that the SARS-CoVs should not be classi- 
fied as a new group but close to the group I coronaviruses [23]. 

In this paper, we select 25 complete virus genomes: 12 coronav- 
iruses from the three isolated typical groups, 12 SARS-CoV strains, 
and a torovirus, which serves as the outgroup for coronaviruses 
[24] (data are shown in Table 1). In order to validate our method, 
distance matrices for the same data set are also constructed using 
some classical dissimilarity measurements, e.g., the standard 
Euclidean distance [15,25], linear correlation coefficient [26], Kull- 
back-—Leibler (KL) discrepancy [3] and the Composition Vector ap- 
proach [16,27]. Note that the Kullback-Leibler discrepancy 
between two frequency vectors is not symmetrical and will give 
degenerate results when some word types are absent, we use a re- 


vised version — the Weighted Sequence Entropy (WSE) [28]. This 
modification works equivalently with the KL discrepancy in the 
case of short words, and can effectively avoid the degeneracy for 
long words. The string Composition Vector (CV) approach proposed 
by Hao’s group is a fast and efficient approach to whole genome 
comparison and phylogenetic analysis. For each k-string w, define 


furfo fy 
CV,.= Fo? I (7) 
0, le — 0, 


where f,, is the frequency of word w in a genomic sequence, and f,, 
is its expect frequency under a certain background model (Markov 
model of k — 2 order). Then collect CV,, for all possible w as compo- 
nents to form a composition vector. The final distance between two 
species is evaluated based on the cosine function between their cor- 
responding composition vectors. 

After calculating the pairwise distance matrices, phylogenetic 
trees for the 25 viruses are built by the UPGMA and NJ programs in 
the PHYLIP package. Then, rooted phylogenetic trees are drawn by 
the TREEVIEW program [29]. The UPGMA tree built by the standard 
Euclidean distance is shown as Fig. 1(1). This tree supports torovirus 
as the outgroup of all coronaviruses, but fails to cluster three group | 
coronaviruses — HCoV-229E and PEDV are grouped together, but 
TGEV is much closer to the SARS clade. Fig. 1(2) is the NJ tree con- 
structed by the Euclidean distance. Similar to the UPGMA tree, this 
tree also prefers SARS-CoVs clustering with TGEV. But an obvious de- 
fect is that it does not successfully cluster the eight group II coronav- 
iruses. In Fig. 2, we list the trees built by linear correlation coefficient 
between pairwise frequency vectors. Fig. 2(1) is the UPGMA tree. 
This tree perfectly clusters species within each typical group, and 
confirmed SARS-CoVs paraphyly. But it fails to identify the outgroup 
status of torovirus relative to coronaviruses. While the NJ tree 
(Fig. 2(2)), in which torovirus is selected as outgroup species, con- 
firms the adjacent relationship of SARS-CoVs with group I viruses. 
In the tree built from our distance measure dexp (Fig. 3), all above de- 
fects are eliminated, i.e., species of each typical groups cluster, and 
torovirus stays outside of all coronaviruses including SARS-CoVs. 
Our tree shows that SARS-CoVs are not closely related to any previ- 
ously isolated coronaviruses and form a new group, but do not sup- 
port the outgroup status of SARS-CoVs relative to other 
coronaviruses, as proposed by Zheng et al. [30]. This result is mainly 
in accordance with the WSE tree at word order k = 6 (Fig. 4) and the 
NJ tree constructed by the Composition Vector method (Fig. 5). 
Moreover, it is also supported by the experimental evidence, which 
showed that group! coronaviruses specific antibodies are able to rec- 
ognize antigens in SARS-CoV infected cultured cells [31]. 


3.2. Whole mitochondrial genome phylogeny of 20 Eutherian 
mammals 


In order to further validate our algorithm, we use the complete 
mtDNA sequences of 20 Eutherian mammals selected by Otu and Say- 
ood as our second dataset [32]. This dataset consists of seven Primates, 
eight Ferungulates, two Rodents and three non-placental mammals. 
Their corresponding GenBank Accession Codes are as follows: 


e Primates: Human (Homo sapiens, VOO662), common chimpanzee 
(Pan troglodytes, D38116), pigmy chimpanzee (Pan paniscus, 
D38113), gorilla (Gorilla gorilla, D38114), orangutan (Pongo pyg- 
maeus, D38115), gibbon (Hylobates lar, X99256) and baboon 
(Papio hamadryas, Y18001). 

e Ferungulates: Horse (Equus caballus, X79547), white rhinoceros 
(Ceratotherium simum, YO7726), harbor seal (Phoca vitulina, 
X63726), gray seal (Halichoerus grypus, X72004), cat (Felis catus, 
U20753), fin whale (Balenoptera physalus, X61145), blue whale 
(Balenoptera musculus, X72204) and cow (Bos taurus, VO0654). 
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Table 1 
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Coronaviruses and a torovirus used to constructed phylogenetic tree. 


No. Accession No. Abbreviation Genome Group Length (nt) 
1 NC_002654 HCoV-229E Human coronavirus 229E I 27317 
2 NC_002306 TGEV Transmissible gastroenteritis virus I 28586 
3 NC_003436 PEDV Porcine epidemic diarrhea virus I 28033 
4 U00735 BCoVM Bovine coronavirus strain Mebuus II 31032 
5 AF391542 BCoVL Bovine coronavirus isolate BCoV-LUN II 31028 
6 AF220295 BCoVQ Bovin coronavirus strain Quebec II 31100 
7 NC_003045 BCoV Bovine coronavirus II 31028 
8 AF208067 MHVM Murine hepatitis virus strain ML-10 II 31233 
9 AF201929 MHV2 Murine hepatitis virus stain 2 II 31276 

10 AF208066 MHVP Murine hepatitis virus stain Penn 97-1 II 31112 

11 NC_001846 MHV Murine hepatitis virus II 31357 

2 NC_001451 IBV Avian infectious bronchitis virus Ill 27608 

13 AY278488 BJO1 SARS coronavirus BJO1 - 29725 

14 AY278741 Urbani SARS coronavirus Urbani - 29727 

15 AY278491 HKU-39849 SARS coronavirus HKU-39849 - 29742 

16 AY278554 CUHK-W1 SARS coronavirus CUHK-W1 - 29736 

iz AY282752 CUHK-Su10 SARS coronavirus CUHK-Su10 - 29736 

18 AY283794 SIN2500 SARS coronavirus SIN2500 - 29711 

19 AY283795 SIN2677 SARS coronavirus SIN2677 - 29705 

20 AY283796 SIN2679 SARS coronavirus SIN2679 - 29711 

v2) AY283797 SIN2748 SARS coronavirus SIN2748 - 29706 

22 AY283798 SIN2774 SARS coronavirus SIN2774 - 29711 

23 AY291451 TW1 SARS coronavirus TW1 - 29729 

24 NC_004718 TOR2 SARS coronavirus - 29751 

V5) X52374 EToV Equine torovirus - 7920 

Torovirus Torovirus 
IBV IBV 
MHV2 HCoV229E 
MIHVP PEDV 
MHVM TGEV 
MHV HKU .39849 
BCoeVQ CUHK-W1 
BCoVM CUHK-Sul0 
BCoVL TOR? 
BCoV BJ01 
HCoV229E Urbani 
PEDV Twi 
TGEV SIN2679 
BOT SIN2677 
Urbani SIN2774 
SIN2677 SIN2500 
SIN2Z/74 SIN2748 
Twi MHV? 
SIN7679 MHVP 
SIN2Z500 MHVM 
SIN2Z748 MHV 
HKU.39849 BCeVQ 
CUNK.W1 BCoVM 
CUHK-Sul0 BCoVL 
TOR? BCoV 

0.001 0.001 


(1) 


Fig. 1. (1) UPGMA and (2) Neighbor-Joining trees of 25 viruses (k = 6). Pairwise distances are evaluated by the standard Euclidean distance. 


e Rodents: Rat (Rattus norvegicus, X14848 ) and mouse (Mus muscu- 
lus, VOO711). 

e Non-placental mammals: Opossum _ (Didelphis _ virginiana, 
Z29573), wallaroo (Macropus robustus, Y10524) and platypus 
(Ornithorhyncus anatinus, X83427). 


We applied the proposed distance measurements to the com- 
plete mitochondrial genomes listed above. In Fig. 6, we list the UP- 


GMA tree constructed by the distance dx, with background 
frequencies estimated by Eq. (6). As is seen from this figure, three 
main groups of placental mammals, namely Primates, Ferungulates 
and Rodents, cluster accordingly, and three non-placental mam- 
mals stay outside of all other species. This topology is in perfect 
agreement with that given by Otu and Sayood except for the posi- 
tion of rodents (mouse and rat). However, the relationship among 
the three main groups of placental mammals is still a controversial 


Fig. 2. (1) UPGMA and (2) Neighbor-Joining trees of 25 viruses (k = 6). Pairwise distances are evaluated by the linear correlation coefficient. 
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SIN2677 
SIN2774 
Twi 
SIN2679 
SIN2500 
SIN2748 
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Torovirus 
IBV 
HCoV-229E 
PEDV 
TGEV 

HKU 39849 
TOR? 
CUHK-Wi1 
CUHK.Sutl0 
Urbani 
T1 
BJOT 
SIN2679 
SIN2748 
SIN2/74 
SIN2500 
SIN26/77 

MHV2 

MHVP 

MHVM 

MHV 

BCoVO 
BCoV 
BCoVL 
BCoV 
{2) 


Torovirus 
IBV 
MHV2 
MHVP 
MHVM 
MHV 
BCoVOQ 
BCoVM 
BCoVL 
BCoV 
BJO 
CUHK-Wi1 
HKU.39849 
CUHK-Su10 
TOR2 
Urbani 
Tw 
SIN2677 
SIN2774 
SIN2679 
SIN2500 
SIN2748 


Outgroup 
Group Ill 


Group I! 


SARS-CoVs 


PEDV 
TGEV | Group | 


HCoV229E 


Fig. 3. Phylogenetic tree built by our distance d;x, at k = 6, where background frequency of each word is estimated by the product of the corresponding nucleotide 


frequencies. 
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Torovirus Outgroup 
IBV Group Ill 
MHV2 

MHVP 

MHVM 

MHV Group Il 
BCoVO 

BCoVM 

BCoVL 

BCoV 

TGEV Group | 
CUHK .W1 

CUHK-Sul0 

Urbani 

HKU.39849 

BJ01 

TOR2 SARS-CoVs 
™1 

SIN2677 

SIN2679 

SIN2774 

SIN2500 

SIN2748 


| 
PEDV Group | 


Fig. 4. Phylogenetic tree built by the Weighted Sequence Entropy (WSE) at word length k = 6. 


Torovirus Torovirus 
TGEV TGEV 
MHV2 HCoV229E 
MHVP PEDV 
MHVM Urbani 
MHV HKU-39849 
BCoVO BJO1 
BCoVM CUHK-W1 
BCoVL CUHK-Su0 
BCoV TOR2 
Urbani TWw1 
HKU39849 SIN2679 
CUHK-W1 SIN2677 
CUHK-Sut0 SIN2500 
TwI SIN2748 
B01 SIN2774 
SIN2677 MHV2 
SIN2679 MHVP 
TOR2 MHVM 
SIN2774 HV 
SINZ00 BCoVO 
SIN2748 BCoVM 
IBV BCoVL 
PEDV IBV 


= 0.1 


(1) (<} 


Fig. 5. (1) UPGMA and (2) Neighbor-Joining trees of 25 viruses built by the string Composition Vector approach. 


mates, Rodents) grouping while other proteins support the Ro- 
dents (Ferungulates, Primates) grouping [34]. Whereas our result 
suggests an alternative topology of Primates (Ferungulates, Ro- 


topic in molecular genetics [33]. Different types of molecular data 
and analysis methods result in different trees. By the maximum 
likelihood method, some proteins support the Ferungulates (Pri- 
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baboon 
gibbon 
orangutan 
gorilla Primates 
human 
c.chimpa 
p.chimpa 
horse 
w.thinoc 
cow 
f.whale 

Ferungulates 
b.whale 
cat 
h.seal 


g.seal 


rat 
Rodents 
mouse 


platypus 


opossum Outgroup 


wallaioo 


Fig. 6. The UPGMA tree built from the complete mtDNA sequences of 20 mammals. We use the distance metric dex), and background frequencies of words are estimated by 


the Markov model of order k — 2. 


dents). In addition, we also applied some other word-based metrics 
mentioned above (the standard Euclidean distance, linear correla- 
tion coefficient and KL discrepancy) to the same dataset, but they 
did not give competitive results (not shown in this paper). 


4. Conclusion and discussion 


With the completion of many genome projects of Prokaryotes 
and Eukaryotes, genome level phylogeny constructions are avail- 
able and expected to be more reliable compared to traditional 
experiments on only a single gene or a fragment of genome. How- 
ever, multiple sequence alignment of genomic sequences is still a 
bottleneck, first due to the computational time, and second due 
to the inherent model assumptions. Therefore, there is a great need 
to develop new sequence comparisons free of these problems. In 
recent years, a quantity of alignment-free methods which are 
based on, e.g., k-words frequency [2], graphical representations 
[35-42], and information contents [32,43], have been proposed. 
Nevertheless, compared to alignment methods, these methods 
are still in the premature stage. 

Sequence comparison based on the genomic composition of short 
words may be the most widely studied alignment-free method. It 
has relatively low computational complexity, and does not suffer 
ereatly from genetic rearrangements and transposon activity, which 
serve as common ways of genome evolution. In most cases, biologi- 
cal sequences are represented as occurrence or frequency vectors in 
a high dimensional Euclidean space, and then the standard Euclidean 
distance, linear correlation coefficient, Kullback—-Leibler (KL) dis- 
crepancy or cosine function between these vectors are calculated 
as measures of dissimilarity. In this paper, we investigate two 
word-based distance measurements in a probabilistic framework. 
Our hypothesis is that occurrence of a given word in a random 


DNA sequence follows the Poisson distribution. Then distance be- 
tween two sequences is evaluated by the probability of generating 
one sequence under the Poisson model estimated from the other, 
or their different expression levels of words. In contrast to the tradi- 
tional word-based distances, which use only frequencies of fixed- 
length words, our distances take background information of words 
(estimated by frequencies of some shorter words or the correspond- 
ing nucleotide composition) into account. In other words, our meth- 
od has a potential to adjust the background information for distance 
measurements using composition vector. Through constructing 
phylogenetic trees of 25 viruses including SARS-CoVs and 20 Euthe- 
rian mammals, we find that our method gives a more competitive re- 
sult compared to the ongoing word-based methods. 

It is detected that each component CV,, of the string Compo- 
sition Vector is also a measure of expression in terms of word w. 
In Eq. (7), the numerator f., — f,, is the deviation of the observed 
frequency from the expected value, and denominator is intro- 
duced to eliminate the size effect. However, different from our 
measure (Eq. (3)), the value of CV,, may be affected by those 
words with very low background frequency, i.e., when f,, is very 
small, the corresponding CV,, will be very large. While our mea- 
sure is free of this problem as it ranges from O to 1. In other 
words, our method can avoid the noise accompanied by words 
with exceptional background frequencies. 

However, compared to those word-based measurements which 
consider only composition vectors, our distances have relatively 
high computational costs. For example, occurrences of many words 
are much higher than 60 in some bacterial genomes (when k = 10), 
which makes our Poisson-based distances computationally infeasi- 
ble. So a reliable and efficient approximation of Poisson probability 
is critical to our method. In addition, the accuracy of our approach 
depends strongly on the Poisson model of word occurrences. This 
assumption is generally valid when the sequence length is suffi- 
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ciently large. But for words with overlapping structure, e.g., TATATA 
and CCGCCG, their occurrences in a random sequence may vary sig- 
nificantly from the Poisson distribution. While at the same time, 
experiments showed that these self-overlapping words are more 
prone to be functional patterns in regular regions of genomes. In 
the future study, we will explore some models to describe and com- 
pare these words. 
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